Introduction to Open Data Science - Course Project

1st excersice

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Dec 11 09:41:33 2023"

My experience with RStudio

I feel fairly comfortable working with RStudio as it is one of the main tools I use daily. However, after browsing trough the “R for health Data Science” I feel there are several ways to use R. I tend to structure my code in rather different fashion. Perhaps the most striking difference is the use of pipeline. Example file uses pipelines everywhere! where as I tend to avoid it as i feel it makes the code hard to follow even if it would shorten and simplify the code. For example if we create random data frame with tree columns

exampledata1=c(2,4,6,6,8,9,10)
exampledata2=c(70.2,44.0,62.7,106.1,81.1,96.1,10.8)
exampledata3=c(16,23,10,12,9,10,17)
df=data.frame(exampledata1,exampledata2,exampledata3)
print(df)
##   exampledata1 exampledata2 exampledata3
## 1            2         70.2           16
## 2            4         44.0           23
## 3            6         62.7           10
## 4            6        106.1           12
## 5            8         81.1            9
## 6            9         96.1           10
## 7           10         10.8           17

and we want to caluclate mean of the 3rd column, I would use following command:

mean(df$exampledata3)
## [1] 13.85714

whereas excercise dataset would use command

df$exampledata3 %>% mean()
## [1] 13.85714

Both ways lead to same results, even if syntax is different.

Another difference is plotting. Excercise uses ggplot2 where I am used to use R base plots. I know ggplot2 is more versatile and widely used tool. How ever i find its logic confusing compared to base plots. I wish i will learn to use ggplot2 during this course.

Overall i cannot say how well exercise set worked as crash course as most of this was already familiar to me. How ever the parts about ggplot2 was my favorite as it helped me to understand logic behind commands. Using markup is not familiar to me - i have only used Rscripts earlier.

Chapter 2: Linear regression

date()
## [1] "Mon Dec 11 09:41:34 2023"

Data

First step is to read the data created during the data wrangling part. In this case it is stored on my computer.

learningdata=read.csv("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/learning2014.csv")
head(learningdata)
##   X gender Age Attitude     deep  stra     surf Points
## 1 1      F  53       37 3.583333 3.375 2.583333     25
## 2 2      M  55       31 2.916667 2.750 3.166667     12
## 3 3      F  49       25 3.500000 3.625 2.250000     24
## 4 4      M  53       35 3.500000 3.125 2.250000     10
## 5 5      M  49       37 3.666667 3.625 2.833333     22
## 6 6      F  38       38 4.750000 3.625 2.416667     21

Data appears to be in good order. After importing data it is possible to inspect it.

str(learningdata)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

There are 8 different variables (X, gender, age, attitude, deep, stra, surf and points). Data includes different variables related to learning. X is number used to identify observations. It is irrelevant for the data exploration as it is not measurement etc. Just a tool to identify different persons. Therefore it can be excluded from graphical investigation. Scaterplot matrix is a useful tool to understand different relationships between variables. Gender could be treated as dummy-varable as it is ether male of female in this data. To clarify the visuals we can use different collours on both gender.

library(GGally)
## Warning: package 'GGally' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
p <- ggpairs(learningdata[-1], mapping = aes(alpha=0.3, col=gender), lower = list(combo = wrap("facethist", bins = 20)))
p

Correlation between variables can be seen in the top/right part of the graph (total and gender separately). left/bottom part of the graph shows scatter plot. Axis on th middle shows the distribution of observations. Top row is a boxplot on effect of gender to different variables. left column is histogram of variable by gender. Top left corner is histogram of genders. There are several aspects we can observe from this one graph.

  • There are more females than males in the data set
    • Most of the test subjects are young: about 20-25 years old
      • Males are usually slightly older.
      • There is a lack of observations of middle aged men.
    • On average males have higher attitude score than women.
    • There are little to no difference in deep learning between genders.
    • The strategic and surface learning score is slightly higher on women.
    • On average there are no differences between scores, but the top scores are slightly male dominated.
  • For the most parts there are low correlation between different variables. How ever there are expections:
    • Points and gender have positive correlation (0.44)
    • Surface and deep learining has negative correlation (-0.32)
      • For males correlation is high, over -0.6, where on women it is just -0.09

Modeling

To create model we should select variables that have high correlation with explained variable (points in this case). How ever if 2 variables have high correlation with explained variable, but they are also highly correlated with each other, they should be not both used. In this case points are correlated with attitude. Other attributes have quite low correlation, but variables stra and surf has some level correlation with points, so we add them to model as well.

model = lm(Points ~ Attitude + stra +surf, data = learningdata)
summary(model)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = learningdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

According to the summary table, surf has extraordinary large p-value and therefore model can be simplified.

model2 = lm(Points ~ Attitude + stra, data = learningdata)
summary(model2)
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = learningdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## Attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

If we drop the surf variable form the model, we get model more simple model with higher r-squared and smaller p-value. Now the instructions of this exercise advise that non significant terms should be removed. How ever the question of significance is not black and white. It is more about how much uncertainty we are ready to accept. In model2 variable stra has p-value if 0.089. If we use 95% confidence level, the term is ot significant (p > 0.05) but if we use 90% confidence, it is (p < 0,1). Luckily we have more tools as decide whether or not we wish to simplify the model further.

model3 = lm(Points ~ Attitude, data = learningdata)
summary(model3)
## 
## Call:
## lm(formula = Points ~ Attitude, data = learningdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

If we simplify the model even further we start to loose some of validity of the model as adjusted p-value starts to decrease and therefore we might not want to only have one explanatory variable and therefore we are happy with model2.

We can further analyze model variables using Bayesian Information Criterion (BIC)

library(flexmix)
## Warning: package 'flexmix' was built under R version 4.2.3
## Loading required package: lattice
BIC(model)
## [1] 1046.051
BIC(model2)
## [1] 1041.486
BIC(model3)
## [1] 1039.323

According to BIC the model with only attitude as predictor is the best despite lower R-squared.

We can yet further use tools to compare different models. Command “step” drops predictors one by one until the smallest AIC value has reaches and further simplification would increase the AIC.

step(model)
## Start:  AIC=557.4
## Points ~ Attitude + stra + surf
## 
##            Df Sum of Sq    RSS    AIC
## - surf      1     15.00 4559.4 555.95
## <none>                  4544.4 557.40
## - stra      1     69.61 4614.0 557.93
## - Attitude  1    980.95 5525.3 587.85
## 
## Step:  AIC=555.95
## Points ~ Attitude + stra
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  4559.4 555.95
## - stra      1     81.74 4641.1 556.90
## - Attitude  1   1051.89 5611.3 588.41
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = learningdata)
## 
## Coefficients:
## (Intercept)     Attitude         stra  
##      8.9729       0.3466       0.9137

Smallest AIC score is achieved with Attitude and stra as predictors (=model2), suggesting it is the best model. It probably could be argued that model2 and model3 are both valid, but i would prefer model2. So let’s take a closer look on that.

summary(model2)
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = learningdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## Attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Model validication

According to summary table model intercept is 8.97. Parameter for attitude is 0.35 and for stra 0.91. So the formula for model would be \(Points=8.97+attitude*0.35+stra*0.91\). This means that 1 unit increase in attitude, increases points 0.35 and one unit increase of stra increases the points by 0.91. If both attetude and stra is 0, model suggests that test subject recive about than 9 points.

R-squared explains how much variation of dependent variable is explained by independent variable. .Multiple R-squared explains how much variation in dependent variable is explained by the model variables. 1 would mean all variation is explained by it and 0 would mean no variation is explained. Multiple \(R^2\) of this model is 0.20, suggesting that 20 % of the variation is explained by the independent variables. In addition summary table provides a adjusted \(R^2\). This adjustment makes it easier to compare models with different ammounts of predictorors, as higher number of predictors tend to increase \(R^2\) even if increasing the number if predictors would not increase the quality of the model as predictors will explain some percent of variance - even by coincidence.

plot(model2, which = 1)

Residuals vs fitted values are used to check for un-linearity. Residuals describe the distance between observation and model. If residuals have increasing or decreasing trend, we have issues with the model. Ie. we should have used nonlinear regression instead of linear. We can also detect outliers using this plot. the average of the residuals are close to zero trough the graph and there are no clear patterns so all is good. How ever we can detect there are some outliers in the data illustrated at the lower middle part of the graph.

plot(model2, which = 2)

Normal Q-Q plot sorts observations by residuals into ascending order. Majority of observations are where one would expect them to be in normally distributed data. Only the tails are a bit off from expected values. How ever as we have limited sample size I would not be too worried about that. As a conclusion: data is or is close to be normally distributed.

plot(model2, which = 5)

Residuals vs Leverage is used to find influential observations. High leverage means changes in that observations would affect the model more compared to observations with low leverage. In this case it seems there are no highly influential observations as all of them are within cook’s distance (we can’t even see the threshold in this zoom). We can also see that residuals do not really change as a function of a leverage, indicating yet again normal distribution and that variances of observations have no patterns.



Chapter 3: Linear regression

Disclaimer! I have had super busy week and unfortunately I did not have enough time to really focus this week and I am coding this last minute before deadline. So if and when i have made typos and silly mistakes, please be patient and I am sorry.

date()
## [1] "Mon Dec 11 09:41:53 2023"

Data

First step is to read the data created during the data wrangling part. In this case it is stored on my computer.

alc=read.csv("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/alcdata.csv")
head(alc)
##   X school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4 4     GP   F  15       U     GT3       T    4    2   health services
## 5 5     GP   F  16       U     GT3       T    3    3    other    other
## 6 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime schoolsup famsup activities nursery
## 1     course   mother          2         2       yes     no         no     yes
## 2     course   father          1         2        no    yes         no      no
## 3      other   mother          1         2       yes     no         no     yes
## 4       home   mother          1         3        no    yes        yes     yes
## 5       home   father          1         2        no    yes         no     yes
## 6 reputation   mother          1         2        no    yes        yes     yes
##   higher internet romantic famrel freetime goout Dalc Walc health failures paid
## 1    yes       no       no      4        3     4    1    1      3        0   no
## 2    yes      yes       no      5        3     3    1    1      3        0   no
## 3    yes      yes       no      4        3     2    2    3      3        2  yes
## 4    yes      yes      yes      3        2     2    1    1      5        0  yes
## 5    yes       no       no      4        3     2    1    2      5        0  yes
## 6    yes      yes       no      5        4     2    1    2      5        0  yes
##   absences G1 G2 G3 alc_use high_use
## 1        5  2  8  8     1.0    FALSE
## 2        3  7  8  8     1.0    FALSE
## 3        8 10 10 11     2.5     TRUE
## 4        1 14 14 14     1.0    FALSE
## 5        2  8 12 12     1.5    FALSE
## 6        8 14 14 14     1.5    FALSE

Let’s remove the first column “x” as it is duplicate of row numbers

alc=alc[,-1]

Let’s take a look on our data

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

We have 35 columns describing information about students. Some variables describe background information. Variables G1, G2 and G3 decribes the grades of testsubjects. Alc_use tells how large is alcohol consumption and high use tells whether the consumption is high or not.


The 4 variables i would like to choose are absence, G3, pstatus and age. I assume that high absence, low grades, broken families and older age is linked to high consumption of alcohol.

First I would like to check how those variables are related to each other. But before that, let’s instal some packages!

library(tidyr); library(dplyr); library(ggplot2)
## Warning: package 'tidyr' was built under R version 4.2.3
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
table(alc$Pstatus, alc$high_use)
##    
##     FALSE TRUE
##   A    26   12
##   T   233   99
g1 <- ggplot(alc, aes(x = high_use, y = age))
g2 <- ggplot(alc, aes(x = high_use, y = G3))
g3 <- ggplot(alc, aes(x = high_use, y = absences))


g1 + geom_boxplot()

g2 + geom_boxplot()

g3 + geom_boxplot()

Using tables and boxplots we can assume

  • high consumption of alcohol is slightly more common in broken families
  • Young people use less alcohol
  • People with high grades have smaller alcohol consumption
  • people with high alcohol consumption are more often absence

Therefore hypotheses presented seem to be correct.


m <- glm(high_use ~ G3 + absences + age + Pstatus, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ G3 + absences + age + Pstatus, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3352  -0.8239  -0.7050   1.2037   1.8781  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.52154    1.82438  -1.382 0.166931    
## G3          -0.06845    0.03602  -1.900 0.057371 .  
## absences     0.08199    0.02335   3.511 0.000446 ***
## age          0.12084    0.10147   1.191 0.233666    
## PstatusT     0.05517    0.39732   0.139 0.889572    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 428.47  on 365  degrees of freedom
## AIC: 438.47
## 
## Number of Fisher Scoring iterations: 4

According to the summary of created model only G3 and absences are statistically significant. Family status has only little to do with the alcohol consumption. More simple model would be better. For that we can use step-command that finds the model with smallest (=best) AIC score.

step(m)
## Start:  AIC=438.47
## high_use ~ G3 + absences + age + Pstatus
## 
##            Df Deviance    AIC
## - Pstatus   1   428.49 436.49
## - age       1   429.89 437.89
## <none>          428.47 438.47
## - G3        1   432.09 440.09
## - absences  1   442.85 450.85
## 
## Step:  AIC=436.49
## high_use ~ G3 + absences + age
## 
##            Df Deviance    AIC
## - age       1   429.93 435.93
## <none>          428.49 436.49
## - G3        1   432.17 438.17
## - absences  1   443.07 449.07
## 
## Step:  AIC=435.93
## high_use ~ G3 + absences
## 
##            Df Deviance    AIC
## <none>          429.93 435.93
## - G3        1   434.52 438.52
## - absences  1   445.68 449.68
## 
## Call:  glm(formula = high_use ~ G3 + absences, family = "binomial", 
##     data = alc)
## 
## Coefficients:
## (Intercept)           G3     absences  
##    -0.38732     -0.07606      0.08423  
## 
## Degrees of Freedom: 369 Total (i.e. Null);  367 Residual
## Null Deviance:       452 
## Residual Deviance: 429.9     AIC: 435.9

In this case such model would have just 2 predictors: absence and G3. So let’s create such model.

m <- glm(high_use ~ G3 + absences, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ G3 + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3286  -0.8298  -0.7219   1.2113   1.9242  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.38732    0.43500  -0.890 0.373259    
## G3          -0.07606    0.03553  -2.141 0.032311 *  
## absences     0.08423    0.02309   3.648 0.000264 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 429.93  on 367  degrees of freedom
## AIC: 435.93
## 
## Number of Fisher Scoring iterations: 4

So according to he model the astimate for parameter G3 is -0.08, and parameter for absence is 0.084. intercept is -0.39. To calculate odds ratios, we use following command:

OR <- coef(m) %>% exp
OR
## (Intercept)          G3    absences 
##   0.6788751   0.9267616   1.0878762

Ratios describe the change of predictable variable if the explainable variable is changed one unit.

CI=confint(m)
## Waiting for profiling to be done...
CI
##                   2.5 %       97.5 %
## (Intercept) -1.25260555  0.459482521
## G3          -0.14621889 -0.006471056
## absences     0.04110528  0.131708308

This provides 95% confidence intervals. 97.5 being the upper and 2.5 the lower interval.

Hypothesis vise this would mean that age and family status does not have important role on does one consume high amounts of alcohol or not. And that high grades means less alcohol and hing absences is tied to high alcohol consumption. How eve it is unclear which one is the cause and which is the result.


alc = mutate(alc, probability = predict(m, type = "response"))
alc = mutate(alc, prediction =probability>0.5)
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table()
##         prediction
## high_use      FALSE       TRUE
##    FALSE 0.68108108 0.01891892
##    TRUE  0.25675676 0.04324324

model was good at predicting if high use is false. How ever it also often assumed high use of alcohol even tough actually it was not the case.

g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))

# define the geom as points and draw the plot
g+geom_point()

The visual representation shows that the model often assumes High use to be false even tough it actually would be true.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(alc$high_use,alc$probability)
## [1] 0.2756757

The training error (the mean number of wrong predictions) is 0,28


library(boot)
## Warning: package 'boot' was built under R version 4.2.3
## 
## Attaching package: 'boot'
## The following object is masked from 'package:flexmix':
## 
##     boot
## The following object is masked from 'package:lattice':
## 
##     melanoma
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3

The average number of wrong predictions is 0,2811, which is a greater than error in exercise set (0.26). This means we have worse model than in exercise 3. Lets try to find better model.

m2 <- glm(high_use ~ G3 + absences+ school + sex + address+ famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + schoolsup+ famsup+ activities + nursery + higher + internet + romantic + famrel + freetime + goout + health + failures + paid + absences + G1 + G2, data = alc, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ G3 + absences + school + sex + address + 
##     famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + 
##     guardian + traveltime + studytime + schoolsup + famsup + 
##     activities + nursery + higher + internet + romantic + famrel + 
##     freetime + goout + health + failures + paid + absences + 
##     G1 + G2, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7620  -0.6544  -0.3590   0.4831   2.8990  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.10711    1.79919  -1.727 0.084177 .  
## G3                0.05976    0.11463   0.521 0.602121    
## absences          0.09105    0.02706   3.365 0.000766 ***
## schoolMS         -0.48058    0.53496  -0.898 0.369001    
## sexM              0.84885    0.33281   2.551 0.010755 *  
## addressU         -0.85479    0.41943  -2.038 0.041552 *  
## famsizeLE3        0.51405    0.33078   1.554 0.120176    
## PstatusT         -0.29228    0.53799  -0.543 0.586937    
## Medu              0.04021    0.22256   0.181 0.856645    
## Fedu              0.14434    0.19799   0.729 0.465991    
## Mjobhealth       -1.07147    0.76687  -1.397 0.162353    
## Mjobother        -0.85282    0.50337  -1.694 0.090222 .  
## Mjobservices     -0.82754    0.58036  -1.426 0.153896    
## Mjobteacher      -0.34813    0.70825  -0.492 0.623052    
## Fjobhealth        0.33260    1.19154   0.279 0.780142    
## Fjobother         0.81603    0.86311   0.945 0.344431    
## Fjobservices      1.19959    0.88620   1.354 0.175853    
## Fjobteacher      -0.34774    1.04522  -0.333 0.739367    
## reasonhome        0.55292    0.39470   1.401 0.161262    
## reasonother       1.52294    0.54867   2.776 0.005508 ** 
## reasonreputation  0.01215    0.41787   0.029 0.976799    
## guardianmother   -0.83744    0.37652  -2.224 0.026136 *  
## guardianother    -0.13975    0.80413  -0.174 0.862026    
## traveltime        0.32446    0.24022   1.351 0.176797    
## studytime        -0.27488    0.20833  -1.319 0.187012    
## schoolsupyes      0.03707    0.47710   0.078 0.938071    
## famsupyes        -0.07901    0.33491  -0.236 0.813504    
## activitiesyes    -0.58242    0.31635  -1.841 0.065608 .  
## nurseryyes       -0.48277    0.36927  -1.307 0.191090    
## higheryes        -0.30306    0.73540  -0.412 0.680263    
## internetyes       0.09333    0.45868   0.203 0.838771    
## romanticyes      -0.44975    0.33749  -1.333 0.182662    
## famrel           -0.57891    0.17076  -3.390 0.000698 ***
## freetime          0.29873    0.17003   1.757 0.078928 .  
## goout             0.90284    0.15767   5.726 1.03e-08 ***
## health            0.18851    0.11495   1.640 0.101042    
## failures          0.33167    0.27916   1.188 0.234790    
## paidyes           0.53225    0.32442   1.641 0.100883    
## G1               -0.13568    0.12864  -1.055 0.291563    
## G2                0.10089    0.15969   0.632 0.527532    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 310.30  on 330  degrees of freedom
## AIC: 390.3
## 
## Number of Fisher Scoring iterations: 5
step(m2)
## Start:  AIC=390.3
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + 
##     traveltime + studytime + schoolsup + famsup + activities + 
##     nursery + higher + internet + romantic + famrel + freetime + 
##     goout + health + failures + paid + absences + G1 + G2
## 
##              Df Deviance    AIC
## - Mjob        4   314.58 386.58
## - schoolsup   1   310.30 388.30
## - Medu        1   310.33 388.33
## - internet    1   310.34 388.34
## - famsup      1   310.35 388.35
## - higher      1   310.47 388.47
## - G3          1   310.57 388.57
## - Pstatus     1   310.59 388.59
## - G2          1   310.70 388.70
## - Fedu        1   310.83 388.83
## - Fjob        4   317.06 389.06
## - school      1   311.12 389.12
## - G1          1   311.40 389.40
## - failures    1   311.73 389.73
## - nursery     1   311.99 389.99
## - studytime   1   312.08 390.08
## - romantic    1   312.11 390.11
## - traveltime  1   312.14 390.14
## <none>            310.30 390.30
## - famsize     1   312.70 390.70
## - paid        1   313.03 391.03
## - health      1   313.05 391.05
## - freetime    1   313.44 391.44
## - guardian    2   315.67 391.67
## - activities  1   313.75 391.75
## - address     1   314.45 392.45
## - reason      3   319.54 393.54
## - sex         1   316.96 394.96
## - absences    1   321.59 399.59
## - famrel      1   322.20 400.20
## - goout       1   350.56 428.56
## 
## Step:  AIC=386.58
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Pstatus + Medu + Fedu + Fjob + reason + guardian + traveltime + 
##     studytime + schoolsup + famsup + activities + nursery + higher + 
##     internet + romantic + famrel + freetime + goout + health + 
##     failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - Fjob        4   320.37 384.37
## - Medu        1   314.58 384.58
## - schoolsup   1   314.59 384.59
## - internet    1   314.63 384.63
## - Pstatus     1   314.67 384.67
## - famsup      1   314.72 384.72
## - higher      1   314.82 384.82
## - G2          1   314.84 384.84
## - G3          1   314.92 384.92
## - school      1   315.37 385.37
## - G1          1   315.71 385.71
## - studytime   1   315.74 385.74
## - Fedu        1   315.77 385.77
## - failures    1   316.08 386.08
## - romantic    1   316.33 386.33
## - health      1   316.39 386.39
## - nursery     1   316.44 386.44
## - traveltime  1   316.50 386.50
## <none>            314.58 386.58
## - guardian    2   318.64 386.64
## - famsize     1   317.00 387.00
## - freetime    1   317.40 387.40
## - activities  1   317.68 387.68
## - paid        1   318.01 388.01
## - reason      3   322.94 388.94
## - address     1   319.29 389.29
## - sex         1   321.42 391.42
## - famrel      1   326.22 396.22
## - absences    1   326.30 396.30
## - goout       1   353.06 423.06
## 
## Step:  AIC=384.37
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Pstatus + Medu + Fedu + reason + guardian + traveltime + 
##     studytime + schoolsup + famsup + activities + nursery + higher + 
##     internet + romantic + famrel + freetime + goout + health + 
##     failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - Pstatus     1   320.39 382.39
## - internet    1   320.40 382.40
## - schoolsup   1   320.41 382.41
## - Medu        1   320.44 382.44
## - higher      1   320.51 382.51
## - G3          1   320.61 382.61
## - famsup      1   320.72 382.72
## - G2          1   320.84 382.84
## - Fedu        1   320.90 382.90
## - school      1   321.08 383.08
## - studytime   1   321.52 383.52
## - traveltime  1   321.79 383.79
## - failures    1   321.83 383.83
## - nursery     1   322.04 384.04
## - G1          1   322.23 384.23
## - romantic    1   322.23 384.23
## <none>            320.37 384.37
## - health      1   322.38 384.38
## - freetime    1   322.53 384.53
## - famsize     1   322.93 384.93
## - activities  1   323.16 385.16
## - guardian    2   325.32 385.32
## - reason      3   327.69 385.69
## - paid        1   324.81 386.81
## - address     1   325.55 387.55
## - sex         1   327.22 389.22
## - famrel      1   330.47 392.47
## - absences    1   332.72 394.72
## - goout       1   360.29 422.29
## 
## Step:  AIC=382.39
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Medu + Fedu + reason + guardian + traveltime + studytime + 
##     schoolsup + famsup + activities + nursery + higher + internet + 
##     romantic + famrel + freetime + goout + health + failures + 
##     paid + G1 + G2
## 
##              Df Deviance    AIC
## - internet    1   320.42 380.42
## - schoolsup   1   320.43 380.43
## - Medu        1   320.45 380.45
## - higher      1   320.53 380.53
## - G3          1   320.65 380.65
## - famsup      1   320.75 380.75
## - G2          1   320.85 380.85
## - Fedu        1   320.93 380.93
## - school      1   321.12 381.12
## - studytime   1   321.53 381.53
## - traveltime  1   321.81 381.81
## - failures    1   321.86 381.86
## - nursery     1   322.05 382.05
## - romantic    1   322.25 382.25
## - G1          1   322.29 382.29
## - health      1   322.39 382.39
## <none>            320.39 382.39
## - freetime    1   322.53 382.53
## - famsize     1   323.01 383.01
## - activities  1   323.26 383.26
## - guardian    2   325.33 383.33
## - reason      3   327.71 383.71
## - paid        1   324.81 384.81
## - address     1   325.57 385.57
## - sex         1   327.24 387.24
## - famrel      1   330.51 390.51
## - absences    1   332.91 392.91
## - goout       1   360.30 420.30
## 
## Step:  AIC=380.42
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Medu + Fedu + reason + guardian + traveltime + studytime + 
##     schoolsup + famsup + activities + nursery + higher + romantic + 
##     famrel + freetime + goout + health + failures + paid + G1 + 
##     G2
## 
##              Df Deviance    AIC
## - schoolsup   1   320.45 378.45
## - Medu        1   320.47 378.47
## - higher      1   320.57 378.57
## - G3          1   320.67 378.67
## - famsup      1   320.77 378.77
## - G2          1   320.87 378.87
## - Fedu        1   320.95 378.95
## - school      1   321.13 379.13
## - studytime   1   321.55 379.55
## - traveltime  1   321.83 379.83
## - failures    1   321.87 379.87
## - nursery     1   322.13 380.13
## - romantic    1   322.25 380.25
## - G1          1   322.30 380.30
## <none>            320.42 380.42
## - health      1   322.42 380.42
## - freetime    1   322.59 380.59
## - famsize     1   323.06 381.06
## - activities  1   323.26 381.26
## - guardian    2   325.37 381.37
## - reason      3   327.73 381.73
## - paid        1   324.93 382.93
## - address     1   325.67 383.67
## - sex         1   327.31 385.31
## - famrel      1   330.52 388.52
## - absences    1   333.27 391.27
## - goout       1   360.46 418.46
## 
## Step:  AIC=378.45
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Medu + Fedu + reason + guardian + traveltime + studytime + 
##     famsup + activities + nursery + higher + romantic + famrel + 
##     freetime + goout + health + failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - Medu        1   320.50 376.50
## - higher      1   320.61 376.61
## - G3          1   320.69 376.69
## - famsup      1   320.80 376.80
## - G2          1   320.89 376.89
## - Fedu        1   320.97 376.97
## - school      1   321.13 377.13
## - studytime   1   321.60 377.60
## - traveltime  1   321.84 377.84
## - failures    1   321.90 377.90
## - nursery     1   322.16 378.16
## - romantic    1   322.27 378.27
## - G1          1   322.32 378.32
## <none>            320.45 378.45
## - health      1   322.49 378.49
## - freetime    1   322.61 378.61
## - famsize     1   323.10 379.10
## - activities  1   323.35 379.35
## - guardian    2   325.38 379.38
## - reason      3   327.75 379.75
## - paid        1   325.06 381.06
## - address     1   325.79 381.79
## - sex         1   327.67 383.67
## - famrel      1   330.54 386.54
## - absences    1   333.37 389.37
## - goout       1   360.66 416.66
## 
## Step:  AIC=376.5
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Fedu + reason + guardian + traveltime + studytime + famsup + 
##     activities + nursery + higher + romantic + famrel + freetime + 
##     goout + health + failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - higher      1   320.66 374.66
## - G3          1   320.75 374.75
## - famsup      1   320.87 374.87
## - G2          1   320.93 374.93
## - Fedu        1   321.03 375.03
## - school      1   321.18 375.18
## - studytime   1   321.65 375.65
## - traveltime  1   321.91 375.91
## - failures    1   322.01 376.01
## - nursery     1   322.28 376.28
## - romantic    1   322.35 376.35
## - G1          1   322.37 376.37
## <none>            320.50 376.50
## - health      1   322.54 376.54
## - freetime    1   322.65 376.65
## - famsize     1   323.18 377.18
## - activities  1   323.42 377.42
## - guardian    2   325.60 377.60
## - reason      3   327.81 377.81
## - paid        1   325.07 379.07
## - address     1   325.93 379.93
## - sex         1   327.67 381.67
## - famrel      1   330.59 384.59
## - absences    1   333.40 387.40
## - goout       1   360.66 414.66
## 
## Step:  AIC=374.66
## high_use ~ G3 + absences + school + sex + address + famsize + 
##     Fedu + reason + guardian + traveltime + studytime + famsup + 
##     activities + nursery + romantic + famrel + freetime + goout + 
##     health + failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - G3          1   320.88 372.88
## - famsup      1   321.06 373.06
## - G2          1   321.10 373.10
## - Fedu        1   321.14 373.14
## - school      1   321.38 373.38
## - studytime   1   321.86 373.86
## - traveltime  1   322.07 374.07
## - failures    1   322.33 374.33
## - romantic    1   322.38 374.38
## - nursery     1   322.40 374.40
## - G1          1   322.55 374.55
## <none>            320.66 374.66
## - health      1   322.67 374.67
## - freetime    1   322.79 374.79
## - famsize     1   323.32 375.32
## - activities  1   323.79 375.79
## - guardian    2   325.86 375.86
## - reason      3   327.93 375.93
## - paid        1   325.11 377.11
## - address     1   326.17 378.17
## - sex         1   328.20 380.20
## - famrel      1   330.73 382.73
## - absences    1   333.75 385.75
## - goout       1   360.88 412.88
## 
## Step:  AIC=372.88
## high_use ~ absences + school + sex + address + famsize + Fedu + 
##     reason + guardian + traveltime + studytime + famsup + activities + 
##     nursery + romantic + famrel + freetime + goout + health + 
##     failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - famsup      1   321.23 371.23
## - Fedu        1   321.32 371.32
## - school      1   321.66 371.66
## - studytime   1   322.19 372.19
## - traveltime  1   322.34 372.34
## - failures    1   322.42 372.42
## - G2          1   322.54 372.54
## - romantic    1   322.56 372.56
## - G1          1   322.60 372.60
## - nursery     1   322.63 372.63
## <none>            320.88 372.88
## - health      1   322.91 372.91
## - freetime    1   323.01 373.01
## - famsize     1   323.45 373.45
## - guardian    2   325.96 373.96
## - activities  1   323.98 373.98
## - reason      3   327.99 373.99
## - paid        1   325.33 375.33
## - address     1   326.35 376.35
## - sex         1   328.32 378.32
## - famrel      1   330.73 380.73
## - absences    1   334.31 384.31
## - goout       1   361.17 411.17
## 
## Step:  AIC=371.23
## high_use ~ absences + school + sex + address + famsize + Fedu + 
##     reason + guardian + traveltime + studytime + activities + 
##     nursery + romantic + famrel + freetime + goout + health + 
##     failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - Fedu        1   321.57 369.57
## - school      1   321.87 369.87
## - traveltime  1   322.66 370.66
## - studytime   1   322.74 370.74
## - failures    1   322.78 370.78
## - G2          1   322.87 370.87
## - G1          1   322.87 370.87
## - nursery     1   322.96 370.96
## - romantic    1   322.98 370.98
## - health      1   323.12 371.12
## <none>            321.23 371.23
## - freetime    1   323.23 371.23
## - famsize     1   323.87 371.87
## - guardian    2   326.22 372.22
## - activities  1   324.31 372.31
## - reason      3   328.42 372.42
## - paid        1   325.35 373.35
## - address     1   326.60 374.60
## - sex         1   329.13 377.13
## - famrel      1   330.86 378.86
## - absences    1   334.61 382.61
## - goout       1   361.79 409.79
## 
## Step:  AIC=369.57
## high_use ~ absences + school + sex + address + famsize + reason + 
##     guardian + traveltime + studytime + activities + nursery + 
##     romantic + famrel + freetime + goout + health + failures + 
##     paid + G1 + G2
## 
##              Df Deviance    AIC
## - school      1   322.18 368.18
## - traveltime  1   322.86 368.86
## - failures    1   322.87 368.87
## - nursery     1   323.07 369.07
## - G1          1   323.12 369.12
## - studytime   1   323.23 369.23
## - G2          1   323.23 369.23
## - romantic    1   323.28 369.28
## - freetime    1   323.44 369.44
## <none>            321.57 369.57
## - health      1   323.58 369.58
## - famsize     1   324.02 370.02
## - activities  1   324.48 370.48
## - reason      3   328.70 370.70
## - guardian    2   327.05 371.05
## - paid        1   325.82 371.82
## - address     1   326.98 372.98
## - sex         1   329.62 375.62
## - famrel      1   331.29 377.29
## - absences    1   335.46 381.46
## - goout       1   363.61 409.61
## 
## Step:  AIC=368.18
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     traveltime + studytime + activities + nursery + romantic + 
##     famrel + freetime + goout + health + failures + paid + G1 + 
##     G2
## 
##              Df Deviance    AIC
## - traveltime  1   323.22 367.22
## - G1          1   323.54 367.54
## - nursery     1   323.57 367.57
## - failures    1   323.64 367.64
## - G2          1   323.71 367.71
## - studytime   1   323.76 367.76
## - freetime    1   323.86 367.86
## - romantic    1   324.04 368.04
## <none>            322.18 368.18
## - health      1   324.53 368.53
## - famsize     1   324.55 368.55
## - reason      3   328.79 368.79
## - activities  1   324.93 368.93
## - guardian    2   327.39 369.39
## - paid        1   326.30 370.30
## - address     1   326.99 370.99
## - sex         1   330.30 374.30
## - famrel      1   331.74 375.74
## - absences    1   336.59 380.59
## - goout       1   364.02 408.02
## 
## Step:  AIC=367.22
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     studytime + activities + nursery + romantic + famrel + freetime + 
##     goout + health + failures + paid + G1 + G2
## 
##              Df Deviance    AIC
## - G1          1   324.57 366.57
## - G2          1   324.63 366.63
## - nursery     1   324.69 366.69
## - freetime    1   324.75 366.75
## - failures    1   324.81 366.81
## - studytime   1   324.99 366.99
## - romantic    1   325.10 367.10
## <none>            323.22 367.22
## - reason      3   329.53 367.53
## - health      1   325.57 367.57
## - famsize     1   325.97 367.97
## - activities  1   326.01 368.01
## - guardian    2   329.24 369.24
## - paid        1   327.40 369.40
## - address     1   330.64 372.64
## - sex         1   331.41 373.41
## - famrel      1   332.55 374.55
## - absences    1   337.62 379.62
## - goout       1   366.10 408.10
## 
## Step:  AIC=366.57
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     studytime + activities + nursery + romantic + famrel + freetime + 
##     goout + health + failures + paid + G2
## 
##              Df Deviance    AIC
## - G2          1   324.70 364.70
## - freetime    1   325.93 365.93
## - nursery     1   326.11 366.11
## - failures    1   326.24 366.24
## - studytime   1   326.54 366.54
## <none>            324.57 366.57
## - health      1   327.05 367.05
## - romantic    1   327.05 367.05
## - famsize     1   327.24 367.24
## - reason      3   331.26 367.26
## - activities  1   327.40 367.40
## - guardian    2   330.30 368.30
## - paid        1   329.01 369.01
## - address     1   331.70 371.70
## - sex         1   332.66 372.66
## - famrel      1   334.08 374.08
## - absences    1   338.85 378.85
## - goout       1   366.53 406.53
## 
## Step:  AIC=364.7
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     studytime + activities + nursery + romantic + famrel + freetime + 
##     goout + health + failures + paid
## 
##              Df Deviance    AIC
## - freetime    1   326.08 364.08
## - nursery     1   326.21 364.21
## - failures    1   326.24 364.24
## - studytime   1   326.57 364.57
## <none>            324.70 364.70
## - health      1   327.09 365.09
## - romantic    1   327.32 365.32
## - activities  1   327.43 365.43
## - reason      3   331.46 365.46
## - famsize     1   327.50 365.50
## - guardian    2   330.37 366.37
## - paid        1   329.18 367.18
## - address     1   331.74 369.74
## - sex         1   332.86 370.86
## - famrel      1   334.27 372.27
## - absences    1   338.85 376.85
## - goout       1   367.17 405.17
## 
## Step:  AIC=364.08
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     studytime + activities + nursery + romantic + famrel + goout + 
##     health + failures + paid
## 
##              Df Deviance    AIC
## - nursery     1   327.51 363.51
## - failures    1   327.64 363.64
## - studytime   1   328.04 364.04
## <none>            326.08 364.08
## - health      1   328.44 364.44
## - romantic    1   328.46 364.46
## - activities  1   328.54 364.54
## - reason      3   332.61 364.61
## - famsize     1   328.73 364.73
## - guardian    2   332.13 366.13
## - paid        1   330.49 366.49
## - address     1   332.70 368.70
## - famrel      1   335.05 371.05
## - sex         1   335.40 371.40
## - absences    1   339.96 375.96
## - goout       1   375.82 411.82
## 
## Step:  AIC=363.51
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     studytime + activities + romantic + famrel + goout + health + 
##     failures + paid
## 
##              Df Deviance    AIC
## - failures    1   329.01 363.01
## <none>            327.51 363.51
## - famsize     1   329.70 363.70
## - romantic    1   329.86 363.86
## - studytime   1   329.87 363.87
## - activities  1   329.88 363.88
## - reason      3   333.89 363.89
## - health      1   329.95 363.95
## - paid        1   331.64 365.64
## - guardian    2   334.22 366.22
## - address     1   334.73 368.73
## - famrel      1   336.36 370.36
## - sex         1   336.42 370.42
## - absences    1   340.99 374.99
## - goout       1   376.85 410.85
## 
## Step:  AIC=363.01
## high_use ~ absences + sex + address + famsize + reason + guardian + 
##     studytime + activities + romantic + famrel + goout + health + 
##     paid
## 
##              Df Deviance    AIC
## <none>            329.01 363.01
## - famsize     1   331.20 363.20
## - romantic    1   331.25 363.25
## - reason      3   335.60 363.60
## - activities  1   331.61 363.61
## - health      1   331.97 363.97
## - studytime   1   332.12 364.12
## - paid        1   332.59 364.59
## - guardian    2   335.95 365.95
## - address     1   336.70 368.70
## - sex         1   338.10 370.10
## - famrel      1   338.76 370.76
## - absences    1   343.00 375.00
## - goout       1   382.34 414.34
## 
## Call:  glm(formula = high_use ~ absences + sex + address + famsize + 
##     reason + guardian + studytime + activities + romantic + famrel + 
##     goout + health + paid, family = "binomial", data = alc)
## 
## Coefficients:
##      (Intercept)          absences              sexM          addressU  
##         -1.72681           0.09174           0.91034          -0.94771  
##       famsizeLE3        reasonhome       reasonother  reasonreputation  
##          0.45183           0.20790           1.02210          -0.27714  
##   guardianmother     guardianother         studytime     activitiesyes  
##         -0.75088           0.40855          -0.32528          -0.46361  
##      romanticyes            famrel             goout            health  
##         -0.45457          -0.48800           0.90858           0.17681  
##          paidyes  
##          0.54838  
## 
## Degrees of Freedom: 369 Total (i.e. Null);  353 Residual
## Null Deviance:       452 
## Residual Deviance: 329   AIC: 363
m3 <- glm(high_use ~ absences + sex + address + famsize + 
    reason + guardian + studytime + activities + romantic + famrel + 
    goout + health + paid, data = alc, family = "binomial")
summary(m3)
## 
## Call:
## glm(formula = high_use ~ absences + sex + address + famsize + 
##     reason + guardian + studytime + activities + romantic + famrel + 
##     goout + health + paid, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8674  -0.7171  -0.4029   0.5821   2.7665  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.72681    0.94633  -1.825 0.068041 .  
## absences          0.09174    0.02432   3.773 0.000161 ***
## sexM              0.91034    0.30568   2.978 0.002901 ** 
## addressU         -0.94771    0.34320  -2.761 0.005756 ** 
## famsizeLE3        0.45183    0.30426   1.485 0.137536    
## reasonhome        0.20790    0.36585   0.568 0.569858    
## reasonother       1.02210    0.48846   2.093 0.036393 *  
## reasonreputation -0.27714    0.37752  -0.734 0.462877    
## guardianmother   -0.75088    0.33290  -2.256 0.024096 *  
## guardianother     0.40855    0.73349   0.557 0.577533    
## studytime        -0.32528    0.18830  -1.727 0.084081 .  
## activitiesyes    -0.46361    0.28927  -1.603 0.109002    
## romanticyes      -0.45457    0.30736  -1.479 0.139160    
## famrel           -0.48800    0.15813  -3.086 0.002028 ** 
## goout             0.90858    0.13835   6.567 5.13e-11 ***
## health            0.17681    0.10399   1.700 0.089096 .  
## paidyes           0.54838    0.29252   1.875 0.060834 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 329.01  on 353  degrees of freedom
## AIC: 363.01
## 
## Number of Fisher Scoring iterations: 5
alc2 = mutate(alc, probability = predict(m3, type = "response"))
alc = mutate(alc, prediction =probability>0.5)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(alc$high_use,alc$probability)
## [1] 0.2756757
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2837838

I tried to create as good model as possible by introducing all variable to a model and used step function to find model with smallest AIC ending up with model m3. How ever when calculating the error, it was 0.2783. Just slightly smaller than on the original model. In addition, i am coding this late at night as the deadline is looming, so i must just give up and declare that i cannot find better model than on exercise 3.


Chapter 4: Clustering and classification

date()
## [1] "Mon Dec 11 09:42:00 2023"

Data

Let’s import data. We will use “Boston” data from MASS package.

library(MASS)
## Warning: package 'MASS' was built under R version 4.2.3
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)
data("Boston")

Now we can explore how the data looks

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Dataset “Boston” has 506 observations and 14 variables. Most of the varables are numeric, but two are integrals. Variables of the data and their diecription are presented bellow.

Variable Descripition
crim per capita crime rate by town.
zn proportion of residential land zoned for lots over 25,000 sq.ft
indus proportion of non-retail business acres per town.
chas Charles River dummy variable
nox nitrogen oxides concentration
rm average number of rooms per dwelling.
age proportion of owner-occupied units built prior to 1940.
dis weighted mean of distances to five Boston employment centres.
rad index of accessibility to radial highways.
tax full-value property-tax rate per $10,000.
ptratio pupil-teacher ratio by town.
black proportion of blacks by town
lstat lower status of the population
medv median value of owner-occupied homes in $1000s.

Summary of the data

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Here you can see the pair-wise comparisation of each variable. Matrix is fairly small in size but provides some visual insigts.

One can see that some variables have little relationships (such as age and ptratio) whereas some have strong relationship (nox and dis). Correlation matrix could be easier to interpret.

## Warning: package 'corrplot' was built under R version 4.2.3
## corrplot 0.92 loaded

##Scaling the data

boston_scaled= scale(Boston)
boston_scaled=as.data.frame(boston_scaled)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

After scaling the mean of each variable is 0. So instead of absolute values the value suggests how far from the mean within that variable the observation is.

bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Divide data

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Linear discriminant analysis

lda.fit <- lda(crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)

Here we can find see what variables separates the data into different groups. It seems that rad has the greatest effect on LD1 where as nox has the greatest effect on LD2. If i interpret the graph correctly, access to high ways is linked to high crime rate and nitrogen oxides to medium high rates.

correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16       9        1    0
##   med_low    6      17        5    0
##   med_high   1      15       11    0
##   high       0       0        1   20

In the most cases the model made correct predictions. Especially success rate on predicting high crime rate was good. Predicting medium low crimes seems to be the most difficult. How ever the predictions are not perfect: for example in one case the model predicted medium high crime rate even tough actual rate was low.

data(Boston)
scal_bos=as.data.frame(scale(Boston))
dist_eu <- dist(scal_bos)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

##K-means

Let’s calculate k-means

km <- kmeans(scal_bos, centers = 5)
summary(km)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       70    -none- numeric
## totss          1    -none- numeric
## withinss       5    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           5    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

How ever the means calculated above would be better if we would fing out the optimal number of clusetrs. So let us do that!

k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(scal_bos, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

km <- kmeans(scal_bos, centers = 2)

pairs(scal_bos, col = km$cluster)

It is hard to see what is going on, so let’s take a closer look on some of interesting looking variables

pairs(scal_bos[c(1,10)], col = km$cluster)

pairs(scal_bos[c(1,7)], col = km$cluster)

For example on these two plots one can see clear clustering with crime and tax. High tax and high crime rate seems to create two different clusters. This suggests that there are less crime in areas with high tax.

It also seems that crime rate increases in with owner-occupied units built prior to 1940.


Bonus!

km <- kmeans(scal_bos, centers = 4)
lda.fit <- lda(km$cluster ~ ., data = scal_bos)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2.5)

The most influential variable is black on LD2. On LD1 it is hard to determine, but it might be nox.

Super bonus!

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
model_predictors <- dplyr::select(train, -crime)
lda.fit <- lda(crime ~ ., data = train)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)


plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)

Assigment 5: Dimensionality reduction techniques

Data and data manipulation

Let’s import the data!

human=read.csv("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/human.csv")
human=human[,-1]

Next we move country names to column names and take a look at the data

library(tibble)
## Warning: package 'tibble' was built under R version 4.2.3
library(GGally)
library(dplyr)
human=column_to_rownames(human, "Country")
ggpairs(human)

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

May of the variables are correlated with each other. i.e. life expectancy and education. How ever some variables have little correlation with each other such as labour force participation and secondary education.
***

PCA

pca_human= prcomp(human)
biplot(pca_human, choices = 1:2, cex=c(1, 1), col=c("grey40", "red"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Without standardisation differnet scales makes graph impossible to read. Maximum value of GNI is so large, it dwarfs all other variables. Such we must standardize the variables.

human_std= scale(human)
pca_human_std= prcomp(human_std)
biplot(pca_human_std, choices = 1:2, cex=c(1, 1), col=c("gray40", "red"))

After standardization the graph is easier to interpret. Labo.FM has greatest effect on PC2. It is harder to say witch variable has greatest effect on PC1, perhaps edu.exp.
There are 3 separate groups. labo.fm and paril.F create the first group. They correlated but they are not correlated to any other variable. Variables Mat.Mor and Ado.Birth are correlated and create group 2. Group 3 is composed of the rest of variables. Variables in group 2 and 3 are negative correlated. This means that in countries with high education and high income decreases Maternal Mortality Ratio and Adolescent Birth Rate. In other words, those decrease as welfare increases. Group 1 suggests that Labor Force Participation Rate and female percent Representation in Parliament has little to do with income and education. Thus PC1 represents welfare (or as high welfare is in negative it represents the hardship). And PC2 represents variables that don’t have correlation with that. PC2 could be more about equality. High % of labor suggests all genders are actively working outside home and high precent of females in parliament suggests that they females are active operators in society.


MCA

tea = read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
View(tea)

Lets see of there are NA values

unique(is.na(tea))
##      breakfast tea.time evening lunch dinner always  home  work tearoom friends
## [1,]     FALSE    FALSE   FALSE FALSE  FALSE  FALSE FALSE FALSE   FALSE   FALSE
##      resto   pub   Tea   How sugar   how where price   age   sex   SPC Sport
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##      age_Q frequency escape.exoticism spirituality healthy diuretic
## [1,] FALSE     FALSE            FALSE        FALSE   FALSE    FALSE
##      friendliness iron.absorption feminine sophisticated slimming exciting
## [1,]        FALSE           FALSE    FALSE         FALSE    FALSE    FALSE
##      relaxing effect.on.health
## [1,]    FALSE            FALSE

no Na values

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.2.3
keep=c("home", "work", "tearoom", "friends", "resto", "pub")
tea_loc_sex= select(tea, one_of(keep))
mca <- MCA(tea_loc_sex, graph = FALSE)
plot(mca, invisible=c("ind"), graph.type = "classic", cex=0.8)

I chose variables that represent where tea is consumed. It seems that people taht consume tea, consume it everywhere( work, restaurants, tearooms, pubs etc.). How ever it is interesting that “not home” variable has little correlation to other “not” answers. It seems that dimension 1 is more about “do you drink tea”, but not home is clearly an outlier, as all other not-answers are bellow zero on dimension 1. Lets try other ploting options!

plot(mca, invisible=c("ind"), graph.type = "ggplot", cex=0.8)
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Using ggplot option, the graph is a lot clearer as the texts do not overlap as much.

That’s all folks!

Assigment 6: Analysis of lognitudial data

Data

Let’s import data sets required for the assignment.

BPRSL=read.table("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/BPRSL.csv", header = TRUE, sep = ",")
BPRSL=BPRSL[,-1]
BPRSL$treatment=factor(BPRSL$treatment)
BPRSL$subject=factor(BPRSL$subject)
RATSL=read.table("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/RATSL.csv", header = TRUE, sep = ",")
RATSL=RATSL[,-c(1,4)]
RATSL$Group=factor(RATSL$Group)
RATSL$ID=factor(RATSL$ID)

Analysis of Longitudinal Data I: Graphical Displays and Summary Measure Approach

First we fetch some packages

The data is already in log form. It is not as easy to interpret by eye, but R prefers that format. How ever we can have a look on the data (or part of it)

knitr::kable(glimpse(RATSL))
## Rows: 176
## Columns: 4
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
ID Group Weight Time
1 1 240 1
2 1 225 1
3 1 245 1
4 1 260 1
5 1 255 1
6 1 260 1
7 1 275 1
8 1 245 1
9 2 410 1
10 2 405 1
11 2 445 1
12 2 555 1
13 3 470 1
14 3 535 1
15 3 520 1
16 3 510 1
1 1 250 8
2 1 230 8
3 1 250 8
4 1 255 8
5 1 260 8
6 1 265 8
7 1 275 8
8 1 255 8
9 2 415 8
10 2 420 8
11 2 445 8
12 2 560 8
13 3 465 8
14 3 525 8
15 3 525 8
16 3 510 8
1 1 255 15
2 1 230 15
3 1 250 15
4 1 255 15
5 1 255 15
6 1 270 15
7 1 260 15
8 1 260 15
9 2 425 15
10 2 430 15
11 2 450 15
12 2 565 15
13 3 475 15
14 3 530 15
15 3 530 15
16 3 520 15
1 1 260 22
2 1 232 22
3 1 255 22
4 1 265 22
5 1 270 22
6 1 275 22
7 1 270 22
8 1 268 22
9 2 428 22
10 2 440 22
11 2 452 22
12 2 580 22
13 3 485 22
14 3 533 22
15 3 540 22
16 3 515 22
1 1 262 29
2 1 240 29
3 1 262 29
4 1 265 29
5 1 270 29
6 1 275 29
7 1 273 29
8 1 270 29
9 2 438 29
10 2 448 29
11 2 455 29
12 2 590 29
13 3 487 29
14 3 535 29
15 3 543 29
16 3 530 29
1 1 258 36
2 1 240 36
3 1 265 36
4 1 268 36
5 1 273 36
6 1 277 36
7 1 274 36
8 1 265 36
9 2 443 36
10 2 460 36
11 2 455 36
12 2 597 36
13 3 493 36
14 3 540 36
15 3 546 36
16 3 538 36
1 1 266 43
2 1 243 43
3 1 267 43
4 1 270 43
5 1 274 43
6 1 278 43
7 1 276 43
8 1 265 43
9 2 442 43
10 2 458 43
11 2 451 43
12 2 595 43
13 3 493 43
14 3 525 43
15 3 538 43
16 3 535 43
1 1 266 44
2 1 244 44
3 1 267 44
4 1 272 44
5 1 273 44
6 1 278 44
7 1 271 44
8 1 267 44
9 2 446 44
10 2 464 44
11 2 450 44
12 2 595 44
13 3 504 44
14 3 530 44
15 3 544 44
16 3 542 44
1 1 265 50
2 1 238 50
3 1 264 50
4 1 274 50
5 1 276 50
6 1 284 50
7 1 282 50
8 1 273 50
9 2 456 50
10 2 475 50
11 2 462 50
12 2 612 50
13 3 507 50
14 3 543 50
15 3 553 50
16 3 550 50
1 1 272 57
2 1 247 57
3 1 268 57
4 1 273 57
5 1 278 57
6 1 279 57
7 1 281 57
8 1 274 57
9 2 468 57
10 2 484 57
11 2 466 57
12 2 618 57
13 3 518 57
14 3 544 57
15 3 555 57
16 3 553 57
1 1 278 64
2 1 245 64
3 1 269 64
4 1 275 64
5 1 280 64
6 1 281 64
7 1 284 64
8 1 278 64
9 2 478 64
10 2 496 64
11 2 472 64
12 2 628 64
13 3 525 64
14 3 559 64
15 3 548 64
16 3 569 64

Let’s make it easier to understand what is going on by ploting the data

ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

The weight among groups seems to vary and there seems to be one noticeably larger rat in group 2.

It is easier to comapre differences if we standardize the data.

RATSL = RATSL %>%
  group_by(Time) %>%
  mutate(stdRATSL = (Weight - mean(Weight))/sd(Weight)) %>%
  ungroup()

ggplot(RATSL, aes(x = Time, y = stdRATSL, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized weight")

Now we can see how much standard deviations the observations vary form the mean

RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = (sd(Weight)/(sqrt(length(Weight)))) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Plot the mean profiles
library(ggplot2)
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,4)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1", color=Group), width=0.5) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(bprs) +/- se(bprs)")

Mean weight of group 1 differs clearly from groups 2 and 3. The difference between 2 and 3 is a lot smaller as time increases, their standard error bars overlaps within each other.

Let’s draw box plots!

RATSBOX= RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSBOX, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight)")

There seems to be one outlier in each group. Now in general we should be careful when deleting outliers, as we might as well be deleting legitimate observations and thus causing bias. How ever as this is just a exercise to learn to use the tools we can discard them!

RATSBOX1=RATSBOX[-which(RATSBOX$Group==3 & RATSBOX$mean<500),]
RATSBOX1=RATSBOX1[-which(RATSBOX$Group==1 & RATSBOX$mean<250),]
RATSBOX1=RATSBOX1[-which(RATSBOX$Group==2 & RATSBOX$mean < 550),]

ggplot(RATSBOX1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight)")

Now there are no more outliers. After removing the outliers it would be time to perform a t-test. How ever t-test would require 2 gopups and we have 3. That is an issue! To solve this we will abondon the t-test and move to anova.

summary(aov(mean ~ Group, data = RATSBOX1))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        2 160617   80308    3180 2.49e-12 ***
## Residuals    8    202      25                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems that the groups are significantly different.

Now we can move on to modelling!

orig_rats=read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep ="\t", header = T)
RATS2 = RATSBOX %>%
  mutate(baseline = orig_rats$WD1)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline+Group, data =RATS2)

# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The mean differs significangtly from the baseline. The significance of Group is significant at 10 % confidence level but not on 95%.


Analysis of Longitudinal Data II: Linear Mixed Effects Models for Normal Response Variables

For this section we will use the BPRS-dataset. It is already transformed to long format during datawrangling stage and imported during the “Data” section. Let’s plot the data

library(ggplot2)
ggplot(BPRSL, aes(x = week, y = bprs, group = subject, color=subject)) +
  geom_line()

The data seems quite messy, but there it is. Let’s create a simple regression model!

BPRSL_reg <- lm(bprs ~ week + treatment, data = BPRSL)
summary(BPRSL_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Week is significant where as treatment has no significant effect on bprs.

Let’s continue with Random Intercept Model.

library(lme4)
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.2.3
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'lme4'
## The following object is masked from 'package:flexmix':
## 
##     refit
BPRS_ref = lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Estimate for treatment 2 is a bit (0.6 units) higher for treatment 2 than for treatment 1

Next it is time for Random Intercept and Random Slope Model!

BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

According to the information criteria (AIC and BIC) the models are not too different. Still the treatment 2 is about 0.6 units larger than observation on treatment 1.

We can compare the 2 previous models.

anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As smaller chisq equals better fit, it sees that addition of “week” actually decreased the fit of a model.

Let’s add yet predictors to take passage of time into account.

BPRS_ref2 <- lmer(bprs ~ week + treatment + (week * treatment | subject), data = BPRSL, REML = FALSE)


summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week * treatment | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2525.8   2580.2  -1248.9   2497.8      346 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4842 -0.5448 -0.0829  0.4364  3.7005 
## 
## Random effects:
##  Groups   Name            Variance Std.Dev. Corr             
##  subject  (Intercept)     178.447  13.358                    
##           week              2.924   1.710   -0.86            
##           treatment2      379.364  19.477   -0.76  0.76      
##           week:treatment2   3.457   1.859    0.68 -0.73 -0.82
##  Residual                  36.748   6.062                    
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  44.9472     2.3816  18.872
## week         -2.1809     0.2922  -7.464
## treatment2    2.9726     2.7748   1.071
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.728       
## treatment2 -0.526  0.377
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + (week * treatment | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4                         
## BPRS_ref2   14 2525.8 2580.2 -1248.9   2497.8 233.63  7  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the information criterias we now created a better model and models are significantly different.

Let’s plot the fitted values!

Fitted <- fitted(BPRS_ref2)
mutate(BPRSL, Fitted)
##     treatment subject weeks bprs week   Fitted
## 1           1       1 week0   42    0 39.53023
## 2           1       2 week0   58    0 64.55924
## 3           1       3 week0   54    0 52.40843
## 4           1       4 week0   55    0 63.81002
## 5           1       5 week0   72    0 74.69734
## 6           1       6 week0   48    0 46.40994
## 7           1       7 week0   71    0 57.11848
## 8           1       8 week0   30    0 35.31448
## 9           1       9 week0   41    0 42.35846
## 10          1      10 week0   57    0 57.79629
## 11          1      11 week0   30    0 34.28622
## 12          1      12 week0   55    0 57.23696
## 13          1      13 week0   36    0 35.62519
## 14          1      14 week0   38    0 37.38952
## 15          1      15 week0   66    0 64.36611
## 16          1      16 week0   41    0 43.13426
## 17          1      17 week0   45    0 44.19203
## 18          1      18 week0   39    0 35.00070
## 19          1      19 week0   24    0 30.13647
## 20          1      20 week0   38    0 34.84925
## 21          2       1 week0   52    0 52.47253
## 22          2       2 week0   30    0 30.12844
## 23          2       3 week0   65    0 42.56787
## 24          2       4 week0   37    0 34.83445
## 25          2       5 week0   59    0 63.96044
## 26          2       6 week0   30    0 35.83590
## 27          2       7 week0   69    0 53.30648
## 28          2       8 week0   62    0 53.80063
## 29          2       9 week0   38    0 39.67465
## 30          2      10 week0   65    0 47.66991
## 31          2      11 week0   78    0 81.21145
## 32          2      12 week0   38    0 39.05351
## 33          2      13 week0   63    0 63.37518
## 34          2      14 week0   40    0 39.63652
## 35          2      15 week0   40    0 47.30124
## 36          2      16 week0   54    0 44.02844
## 37          2      17 week0   33    0 37.64772
## 38          2      18 week0   28    0 31.93695
## 39          2      19 week0   52    0 41.57177
## 40          2      20 week0   47    0 39.36629
## 41          1       1 week1   36    1 39.78987
## 42          1       2 week1   68    1 59.57317
## 43          1       3 week1   55    1 48.72009
## 44          1       4 week1   77    1 60.72734
## 45          1       5 week1   75    1 69.11971
## 46          1       6 week1   43    1 43.71205
## 47          1       7 week1   61    1 53.09874
## 48          1       8 week1   36    1 34.26542
## 49          1       9 week1   43    1 39.46305
## 50          1      10 week1   51    1 54.94666
## 51          1      11 week1   34    1 34.60202
## 52          1      12 week1   52    1 54.00613
## 53          1      13 week1   32    1 33.77357
## 54          1      14 week1   35    1 35.68504
## 55          1      15 week1   68    1 59.65388
## 56          1      16 week1   35    1 40.93511
## 57          1      17 week1   38    1 42.21711
## 58          1      18 week1   35    1 33.32487
## 59          1      19 week1   28    1 29.03702
## 60          1      20 week1   34    1 32.87497
## 61          2       1 week1   73    1 51.59396
## 62          2       2 week1   23    1 28.39425
## 63          2       3 week1   31    1 40.02322
## 64          2       4 week1   31    1 33.37756
## 65          2       5 week1   67    1 60.05537
## 66          2       6 week1   33    1 34.11580
## 67          2       7 week1   52    1 50.35266
## 68          2       8 week1   54    1 53.79601
## 69          2       9 week1   40    1 37.07948
## 70          2      10 week1   44    1 46.00169
## 71          2      11 week1   95    1 78.82380
## 72          2      12 week1   41    1 36.72244
## 73          2      13 week1   65    1 59.70320
## 74          2      14 week1   37    1 38.12660
## 75          2      15 week1   36    1 45.16243
## 76          2      16 week1   45    1 40.72771
## 77          2      17 week1   41    1 38.03655
## 78          2      18 week1   30    1 31.62848
## 79          2      19 week1   43    1 38.69472
## 80          2      20 week1   36    1 36.84157
## 81          1       1 week2   36    2 40.04952
## 82          1       2 week2   61    2 54.58710
## 83          1       3 week2   41    2 45.03176
## 84          1       4 week2   49    2 57.64466
## 85          1       5 week2   72    2 63.54207
## 86          1       6 week2   41    2 41.01417
## 87          1       7 week2   47    2 49.07901
## 88          1       8 week2   38    2 33.21635
## 89          1       9 week2   39    2 36.56765
## 90          1      10 week2   51    2 52.09703
## 91          1      11 week2   34    2 34.91782
## 92          1      12 week2   49    2 50.77529
## 93          1      13 week2   36    2 31.92196
## 94          1      14 week2   36    2 33.98057
## 95          1      15 week2   65    2 54.94165
## 96          1      16 week2   45    2 38.73595
## 97          1      17 week2   46    2 40.24218
## 98          1      18 week2   27    2 31.64904
## 99          1      19 week2   31    2 27.93756
## 100         1      20 week2   27    2 30.90070
## 101         2       1 week2   42    2 50.71539
## 102         2       2 week2   32    2 26.66006
## 103         2       3 week2   33    2 37.47856
## 104         2       4 week2   27    2 31.92068
## 105         2       5 week2   58    2 56.15031
## 106         2       6 week2   37    2 32.39571
## 107         2       7 week2   41    2 47.39883
## 108         2       8 week2   49    2 53.79139
## 109         2       9 week2   38    2 34.48432
## 110         2      10 week2   31    2 44.33347
## 111         2      11 week2   75    2 76.43614
## 112         2      12 week2   36    2 34.39137
## 113         2      13 week2   60    2 56.03122
## 114         2      14 week2   31    2 36.61669
## 115         2      15 week2   55    2 43.02361
## 116         2      16 week2   35    2 37.42697
## 117         2      17 week2   30    2 38.42537
## 118         2      18 week2   29    2 31.32002
## 119         2      19 week2   26    2 35.81766
## 120         2      20 week2   32    2 34.31685
## 121         1       1 week3   43    3 40.30917
## 122         1       2 week3   55    3 49.60103
## 123         1       3 week3   38    3 41.34342
## 124         1       4 week3   54    3 54.56197
## 125         1       5 week3   65    3 57.96444
## 126         1       6 week3   38    3 38.31628
## 127         1       7 week3   30    3 45.05927
## 128         1       8 week3   38    3 32.16729
## 129         1       9 week3   35    3 33.67224
## 130         1      10 week3   55    3 49.24740
## 131         1      11 week3   41    3 35.23362
## 132         1      12 week3   54    3 47.54445
## 133         1      13 week3   31    3 30.07034
## 134         1      14 week3   34    3 32.27609
## 135         1      15 week3   49    3 50.22941
## 136         1      16 week3   42    3 36.53680
## 137         1      17 week3   38    3 38.26726
## 138         1      18 week3   25    3 29.97321
## 139         1      19 week3   28    3 26.83811
## 140         1      20 week3   25    3 28.92642
## 141         2       1 week3   41    3 49.83682
## 142         2       2 week3   24    3 24.92588
## 143         2       3 week3   28    3 34.93390
## 144         2       4 week3   31    3 30.46379
## 145         2       5 week3   61    3 52.24525
## 146         2       6 week3   33    3 30.67561
## 147         2       7 week3   33    3 44.44500
## 148         2       8 week3   39    3 53.78677
## 149         2       9 week3   27    3 31.88916
## 150         2      10 week3   34    3 42.66526
## 151         2      11 week3   76    3 74.04848
## 152         2      12 week3   27    3 32.06030
## 153         2      13 week3   53    3 52.35925
## 154         2      14 week3   38    3 35.10678
## 155         2      15 week3   55    3 40.88479
## 156         2      16 week3   27    3 34.12624
## 157         2      17 week3   32    3 38.81420
## 158         2      18 week3   33    3 31.01156
## 159         2      19 week3   27    3 32.94060
## 160         2      20 week3   29    3 31.79213
## 161         1       1 week4   41    4 40.56882
## 162         1       2 week4   43    4 44.61496
## 163         1       3 week4   43    4 37.65508
## 164         1       4 week4   56    4 51.47929
## 165         1       5 week4   50    4 52.38681
## 166         1       6 week4   36    4 35.61840
## 167         1       7 week4   27    4 41.03953
## 168         1       8 week4   31    4 31.11823
## 169         1       9 week4   28    4 30.77683
## 170         1      10 week4   53    4 46.39777
## 171         1      11 week4   36    4 35.54942
## 172         1      12 week4   48    4 44.31361
## 173         1      13 week4   25    4 28.21872
## 174         1      14 week4   25    4 30.57162
## 175         1      15 week4   36    4 45.51718
## 176         1      16 week4   31    4 34.33765
## 177         1      17 week4   40    4 36.29233
## 178         1      18 week4   29    4 28.29738
## 179         1      19 week4   29    4 25.73865
## 180         1      20 week4   25    4 26.95214
## 181         2       1 week4   39    4 48.95824
## 182         2       2 week4   20    4 23.19169
## 183         2       3 week4   22    4 32.38924
## 184         2       4 week4   31    4 29.00690
## 185         2       5 week4   49    4 48.34018
## 186         2       6 week4   28    4 28.95551
## 187         2       7 week4   34    4 41.49118
## 188         2       8 week4   55    4 53.78215
## 189         2       9 week4   31    4 29.29399
## 190         2      10 week4   39    4 40.99704
## 191         2      11 week4   66    4 71.66082
## 192         2      12 week4   29    4 29.72923
## 193         2      13 week4   52    4 48.68727
## 194         2      14 week4   35    4 33.59687
## 195         2      15 week4   42    4 38.74597
## 196         2      16 week4   25    4 30.82551
## 197         2      17 week4   46    4 39.20303
## 198         2      18 week4   30    4 30.70310
## 199         2      19 week4   24    4 30.06355
## 200         2      20 week4   25    4 29.26741
## 201         1       1 week5   40    5 40.82847
## 202         1       2 week5   34    5 39.62889
## 203         1       3 week5   28    5 33.96674
## 204         1       4 week5   50    5 48.39661
## 205         1       5 week5   39    5 46.80917
## 206         1       6 week5   29    5 32.92052
## 207         1       7 week5   40    5 37.01979
## 208         1       8 week5   26    5 30.06917
## 209         1       9 week5   22    5 27.88143
## 210         1      10 week5   43    5 43.54814
## 211         1      11 week5   36    5 35.86522
## 212         1      12 week5   43    5 41.08278
## 213         1      13 week5   25    5 26.36711
## 214         1      14 week5   27    5 28.86714
## 215         1      15 week5   32    5 40.80495
## 216         1      16 week5   31    5 32.13849
## 217         1      17 week5   33    5 34.31741
## 218         1      18 week5   28    5 26.62155
## 219         1      19 week5   21    5 24.63920
## 220         1      20 week5   27    5 24.97787
## 221         2       1 week5   38    5 48.07967
## 222         2       2 week5   20    5 21.45750
## 223         2       3 week5   25    5 29.84459
## 224         2       4 week5   26    5 27.55002
## 225         2       5 week5   38    5 44.43512
## 226         2       6 week5   26    5 27.23542
## 227         2       7 week5   37    5 38.53735
## 228         2       8 week5   51    5 53.77753
## 229         2       9 week5   24    5 26.69883
## 230         2      10 week5   34    5 39.32882
## 231         2      11 week5   64    5 69.27317
## 232         2      12 week5   27    5 27.39816
## 233         2      13 week5   32    5 45.01529
## 234         2      14 week5   30    5 32.08696
## 235         2      15 week5   30    5 36.60715
## 236         2      16 week5   22    5 27.52478
## 237         2      17 week5   43    5 39.59186
## 238         2      18 week5   26    5 30.39464
## 239         2      19 week5   32    5 27.18649
## 240         2      20 week5   23    5 26.74269
## 241         1       1 week6   38    6 41.08812
## 242         1       2 week6   28    6 34.64282
## 243         1       3 week6   29    6 30.27841
## 244         1       4 week6   47    6 45.31393
## 245         1       5 week6   32    6 41.23154
## 246         1       6 week6   33    6 30.22263
## 247         1       7 week6   30    6 33.00005
## 248         1       8 week6   26    6 29.02011
## 249         1       9 week6   20    6 24.98602
## 250         1      10 week6   43    6 40.69851
## 251         1      11 week6   38    6 36.18102
## 252         1      12 week6   37    6 37.85194
## 253         1      13 week6   21    6 24.51549
## 254         1      14 week6   25    6 27.16267
## 255         1      15 week6   27    6 36.09272
## 256         1      16 week6   29    6 29.93934
## 257         1      17 week6   27    6 32.34248
## 258         1      18 week6   21    6 24.94572
## 259         1      19 week6   22    6 23.53975
## 260         1      20 week6   21    6 23.00359
## 261         2       1 week6   43    6 47.20110
## 262         2       2 week6   19    6 19.72331
## 263         2       3 week6   24    6 27.29993
## 264         2       4 week6   24    6 26.09313
## 265         2       5 week6   37    6 40.53006
## 266         2       6 week6   27    6 25.51532
## 267         2       7 week6   37    6 35.58352
## 268         2       8 week6   55    6 53.77292
## 269         2       9 week6   22    6 24.10367
## 270         2      10 week6   41    6 37.66060
## 271         2      11 week6   64    6 66.88551
## 272         2      12 week6   21    6 25.06709
## 273         2      13 week6   37    6 41.34332
## 274         2      14 week6   33    6 30.57705
## 275         2      15 week6   26    6 34.46834
## 276         2      16 week6   22    6 24.22404
## 277         2      17 week6   43    6 39.98068
## 278         2      18 week6   36    6 30.08618
## 279         2      19 week6   21    6 24.30943
## 280         2      20 week6   23    6 24.21797
## 281         1       1 week7   47    7 41.34777
## 282         1       2 week7   28    7 29.65674
## 283         1       3 week7   25    7 26.59007
## 284         1       4 week7   42    7 42.23125
## 285         1       5 week7   38    7 35.65391
## 286         1       6 week7   27    7 27.52475
## 287         1       7 week7   31    7 28.98032
## 288         1       8 week7   25    7 27.97104
## 289         1       9 week7   23    7 22.09061
## 290         1      10 week7   39    7 37.84888
## 291         1      11 week7   36    7 36.49682
## 292         1      12 week7   36    7 34.62110
## 293         1      13 week7   19    7 22.66388
## 294         1      14 week7   26    7 25.45819
## 295         1      15 week7   30    7 31.38048
## 296         1      16 week7   26    7 27.74019
## 297         1      17 week7   31    7 30.36756
## 298         1      18 week7   25    7 23.26988
## 299         1      19 week7   23    7 22.44029
## 300         1      20 week7   19    7 21.02931
## 301         2       1 week7   62    7 46.32252
## 302         2       2 week7   18    7 17.98912
## 303         2       3 week7   31    7 24.75527
## 304         2       4 week7   26    7 24.63625
## 305         2       5 week7   36    7 36.62499
## 306         2       6 week7   23    7 23.79522
## 307         2       7 week7   38    7 32.62969
## 308         2       8 week7   59    7 53.76830
## 309         2       9 week7   21    7 21.50850
## 310         2      10 week7   42    7 35.99238
## 311         2      11 week7   60    7 64.49785
## 312         2      12 week7   22    7 22.73602
## 313         2      13 week7   52    7 37.67134
## 314         2      14 week7   30    7 29.06713
## 315         2      15 week7   30    7 32.32952
## 316         2      16 week7   22    7 20.92331
## 317         2      17 week7   43    7 40.36951
## 318         2      18 week7   33    7 29.77772
## 319         2      19 week7   21    7 21.43238
## 320         2      20 week7   23    7 21.69325
## 321         1       1 week8   51    8 41.60742
## 322         1       2 week8   28    8 24.67067
## 323         1       3 week8   24    8 22.90173
## 324         1       4 week8   46    8 39.14856
## 325         1       5 week8   32    8 30.07628
## 326         1       6 week8   25    8 24.82686
## 327         1       7 week8   31    8 24.96058
## 328         1       8 week8   24    8 26.92198
## 329         1       9 week8   21    8 19.19521
## 330         1      10 week8   32    8 34.99925
## 331         1      11 week8   36    8 36.81262
## 332         1      12 week8   31    8 31.39026
## 333         1      13 week8   22    8 20.81226
## 334         1      14 week8   26    8 23.75372
## 335         1      15 week8   37    8 26.66825
## 336         1      16 week8   30    8 25.54103
## 337         1      17 week8   27    8 28.39263
## 338         1      18 week8   20    8 21.59405
## 339         1      19 week8   22    8 21.34084
## 340         1      20 week8   21    8 19.05504
## 341         2       1 week8   50    8 45.44395
## 342         2       2 week8   20    8 16.25493
## 343         2       3 week8   32    8 22.21061
## 344         2       4 week8   23    8 23.17936
## 345         2       5 week8   35    8 32.71993
## 346         2       6 week8   21    8 22.07513
## 347         2       7 week8   35    8 29.67587
## 348         2       8 week8   66    8 53.76368
## 349         2       9 week8   21    8 18.91334
## 350         2      10 week8   39    8 34.32416
## 351         2      11 week8   75    8 62.11019
## 352         2      12 week8   23    8 20.40495
## 353         2      13 week8   28    8 33.99936
## 354         2      14 week8   27    8 27.55722
## 355         2      15 week8   37    8 30.19070
## 356         2      16 week8   22    8 17.62258
## 357         2      17 week8   43    8 40.75834
## 358         2      18 week8   30    8 29.46925
## 359         2      19 week8   21    8 18.55532
## 360         2      20 week8   23    8 19.16853
library(ggplot2)
ggplot(BPRSL, aes(x = week, y = Fitted, treatment = subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Week", breaks = seq(0, max(BPRSL$week), 1)) +
  scale_y_continuous(name = "Fitted BPRS") +
  theme(legend.position = "top")

comparing fitted and original values we can see the data is a lot easier to understand. Visually it seems that with treatment 1, a smaller BPRS is achieved. …